Introduction to the tm Package Text Mining in R
نویسنده
چکیده
This vignette gives a short introduction to text mining in R utilizing the text mining framework provided by the tm package. We present methods for data import, corpus handling, preprocessing, metadata management, and creation of term-document matrices. Our focus is on the main aspects of getting started with text mining in R—an in-depth description of the text mining infrastructure offered by tm was published in the Journal of Statistical Software (Feinerer et al., 2008). An introductory article on text mining in R was published in R News (Feinerer, 2008).
منابع مشابه
Text Mining Infrastructure in R
During the last decade text mining has become a widely used discipline utilizing statistical and machine learning methods. We present the tm package which provides a framework for text mining applications within R. We give a survey on text mining facilities in R and explain how typical application tasks can be carried out using our framework. We present techniques for count-based analysis metho...
متن کاملData Mining in R using Rattle
This paper is a brief introduction to the concepts, methods and algorithms for data mining in statistical software R using a package named Rattle. Rattle provides a good graphical environment to perform some of the procedures and algorithms without the need for programming. Some parts of the package will be explained by a number of examples. ...
متن کاملTopic Models in R
Topic models are a popular method for modeling the term frequency occurrences in documents. The fitted model allows to better estimate the similarity between documents as well as between a set of specified keywords using an additional layer of latent variables which are referred to as topics. The R package topicmodels provides basic infrastructure for fitting topic models based on data structur...
متن کاملNonparametric Distribution Analysis for Text Mining
A number of new algorithms for nonparametric distribution analysis based on Maximum Mean Discrepancy measures have been recently introduced. These novel algorithms operate in Hilbert space and can be used for nonparametric two-sample tests. Coupled with recent advances in string kernels, these methods extend the scope of kernel-based methods in the area of text mining. We review these kernel-ba...
متن کاملtopicmodels: An R Package for Fitting Topic Models
This article is a (slightly) modified and shortened version of Grün and Hornik (2011), published in the Journal of Statistical Software. Topic models allow the probabilistic modeling of term frequency occurrences in documents. The fitted model can be used to estimate the similarity between documents as well as between a set of specified keywords using an additional layer of latent variables whi...
متن کامل